The loan data from Prosper, an online lending platform that connects people who want to borrow money with individuals and institutions, is explored here. The loan data includes loan characteristics, Prosper's internal tracking data, borrower profile and some lender information.
The analysis will focus on
Theis question is of interest to potential lenders, who may not be familiar with loan pricing mechanism, or even to auditors and competitors.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(style="white")
# load dataset
df = pd.read_csv('prosperLoanData.csv')
df.head()
df.shape
df.info()
#converting all dates to datetime
df.ListingCreationDate = pd.to_datetime(df.ListingCreationDate)
df.LoanOriginationDate = pd.to_datetime(df.LoanOriginationDate)
df.DateCreditPulled = pd.to_datetime(df.DateCreditPulled)
df.ClosedDate = pd.to_datetime(df.ClosedDate)
# LoanStatus into ordered categorical type
loanstatus_order = ['Current', 'FinalPaymentInProgress', 'Completed', 'Past Due (1-15 days)',
'Past Due (16-30 days)', 'Past Due (31-60 days)', 'Past Due (61-90 days)',
'Past Due (91-120 days)', 'Past Due (>120 days)', 'Defaulted',
'Chargedoff']
ordered_var = pd.api.types.CategoricalDtype(ordered = True, categories = loanstatus_order)
df.LoanStatus = df.LoanStatus.astype(ordered_var)
Dictionary of variables mentions a change in loan data applicable for loans originated after July 2009; this includes calculations of Effective Yield, Estimated Loss and Effective Return, key variables to be analysed. Therefore, I will drop the loan listings that occure before July 2009. Before doing this, I will check the loan listing distribution accross time, to ensure there will be sufficient remaining data for the proposed analysis.
plt.figure(figsize=[15,5])
df.groupby([df.ListingCreationDate.dt.year,df.ListingCreationDate.dt.month])['ListingCreationDate'].count().plot(kind='bar')
plt.xlabel('Listing creation period (year, month)')
plt.ylabel('Count')
plt.title('Loan issuance distribution (2005-2014)');
There appears to be sufficient data for analysis for post July 2009 period. Will drop earlier observations (in line when first EstimatedReturn is recorded).
min_date = min(df[~df.EstimatedReturn.isnull()].ListingCreationDate)
min_date
df = df[df.ListingCreationDate>= min_date]
df.shape
As the dataset contains 81 variables, I will create a correlation matrix for the numeric variables, in order to make a preliminary decision as to which set of variables are likely to relate to variables of interest.
This will not be conclusive analysis, because, among others:
#lets check correlations of all numeric variables
plt.figure(figsize=[20,20])
#compute correlations
corr = df.corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Plot heatmap
sns.heatmap(corr, cmap = 'RdBu', vmin = -1, vmax = 1, mask=mask, linewidths=0.5)
plt.title('Correlation table (numerical variables)');
There are 84,853 observations in the dataset (abreviated to include only post 13-July-2009 observations), where 13-July-2009 marks introduction of EstimatedReturn, EstimatedLoss, Estimated Yield, ProsperScore and other key variables to the analysis.
The dataset contains 81 features, of which majority are numerical (61), categorical (17) and boolean (3); some of the numerical variables are numerical categories or ratings (such as 'ProsperRating (numeric)', ListingCategory, ProsperScore).
The analysis will focus on
Theis question is of interest to potential lenders, who may not be familiar with loan pricing mechanism, or even to auditors and competitors.
The following are numerical variables with higher absolute correlation coefficients with estimated lender return, categorised:
Few variables, that are highly correlated such as :
Additional variables (non numerical) to be included in analysis:
Plot histograms of main return components, Estimated Effective Return, Yield and Loss.
columns = 3
fig, ax = plt.subplots(ncols=columns, figsize = [16,4])
variables = ['EstimatedReturn', 'EstimatedEffectiveYield', 'EstimatedLoss']
for j in range(columns):
var = variables[j]
ax[j].hist(df[var], bins=50)
ax[j].set_xlabel('{}'.format(var))
plt.suptitle('Histograms');
They appear to be range bound, in particular, some Returns and Yields fall into negative territory (expected because these are defaulted or chargedoff loans). Will check how many fall into this category.
df[df.EstimatedReturn < 0]['ListingKey'].count(), df[df.EstimatedEffectiveYield < 0]['ListingKey'].count()
Will replot the histograms excluding these cases, as out of >80,000 these are only ca. 200 cases.
columns = 3
fig, ax = plt.subplots(ncols=columns, figsize = [16,4])
variables = ['EstimatedReturn', 'EstimatedEffectiveYield', 'EstimatedLoss']
for j in range(columns):
var = variables[j]
ax[j].hist(df[var][df[var]>0], bins=50)
ax[j].set_xlabel('{}'.format(var))
plt.suptitle('Histograms');
Brief check how estiamted loss related to actual loss:
print(df.LP_NetPrincipalLoss[df.LP_NetPrincipalLoss == 0].count())
print(df.LP_NetPrincipalLoss[df.LP_NetPrincipalLoss != 0].count())
print(df.LP_NetPrincipalLoss[df.LP_NetPrincipalLoss != 0].count()/df.LP_NetPrincipalLoss.count())
columns = 2
fig, ax = plt.subplots(ncols=columns, figsize = [6*columns,4])
variables = ['LoanOriginalAmount', 'LP_NetPrincipalLoss']
for j in range(columns):
var = variables[j]
ax[j].hist(df[var][df[var]>0], bins=20)
ax[j].set_xlabel('{}'.format(var))
plt.suptitle('Histograms');
df['LoanLossRatio'] = df.LP_NetPrincipalLoss/df.LoanOriginalAmount
plt.figure(figsize = [6,4])
plt.hist(df.LoanLossRatio[(df.LoanLossRatio > 0)], bins=100)
plt.xlabel('Loan Loss Ratio')
plt.suptitle('Histogram');
Loan loss ratio is distributed betweet 0 and 1 (in contract to estimated losses, distributed between 0 and 0.3), and is skewed to the left.
Plot histograms of variables that relate to borrowers' current need of credit facilities (inquiries over last 6 months, bank card utilisation, available bank credit), and their current and historic utilisation/behavior wiht credit facilities (current delinquencies, delinquencies of the last 7 years).
columns = 5
fig, ax = plt.subplots(ncols=columns, figsize = [16,3])
variables = ['InquiriesLast6Months', 'CurrentDelinquencies', 'DelinquenciesLast7Years',
'BankcardUtilization', 'AvailableBankcardCredit']
for j in range(columns):
var = variables[j]
ax[j].hist(df[var], bins=50)
ax[j].set_xlabel('{}'.format(var))
plt.suptitle('Histograms');
Here replot the histograms with reduced bin size and excluding zero value on first row, and with log scale on second row. Log scale motivated by the fact that all variables except BankcardUtilization are heavily skewed to the right, and by plotting on log scale, will be better able to distinguish the shape of the distribution.
columns = 5
rows = 2
fig, ax = plt.subplots(nrows=rows, ncols=columns, figsize = [columns*4,rows*3.5])
variables = ['InquiriesLast6Months', 'CurrentDelinquencies', 'DelinquenciesLast7Years',
'BankcardUtilization', 'AvailableBankcardCredit']
for i in range(rows):
for j in range(columns):
var = variables[j]
step = max(df[var])/100
bins = np.arange(0,max(df[var])+step,step)
if i==1:
step = np.log10(df[var].max())/50
bins = 10 ** np.arange(0, np.log10(df[var].max())+step, step)
ax[i, j].set_xscale('log')
ax[i,j].hist(df[var][(df[var] > 0)], bins=bins)
ax[i, j].set_xlabel('{}'.format(var))
plt.suptitle('Histograms (regular scale and log scale)');
Plot histograms of credit score lower and upper range, examine the correlation between the two, and if highly correlated, merge into one variable.
columns = 2
fig, ax = plt.subplots(ncols=columns, figsize = [6*columns,4])
variables = ['CreditScoreRangeLower', 'CreditScoreRangeUpper']
for j in range(columns):
var = variables[j]
ax[j].hist(df[var], bins=20)
ax[j].set_xlabel('{}'.format(var))
plt.suptitle('Histograms');
sns.scatterplot(data=df, x='CreditScoreRangeLower', y='CreditScoreRangeUpper', alpha = 0.5)
plt.title("Credit Score Lower vs Upper Range Barrier")
plt.xlabel("Lower Barrier")
plt.ylabel("Upper Barrier");
The credit score lower and upper range have bell shaped distributions, slightly skewed to the right. The credit score lower and upper range appear to be perfectly correlated; I will therefore compute a new single variable Credit Score Mid, that is a mean of the two variables.
df['CreditScoreMid'] = (df.CreditScoreRangeUpper+df.CreditScoreRangeLower)/2
df.CreditScoreMid.mean()
plt.hist(df.CreditScoreMid, bins=20)
plt.xlabel('Credit Score (Mid)')
plt.suptitle('Histogram');
Plot histograms of variables that relate to borrowers' income profile and affordability of the loan, including Debt To Income Ratio, Stated Monthly Income, Monthly Loan Payment, Is Borrower Homeowner.
Below I plot side by side a histogram for full range of values on the left, and zooming into where majority of values for debt to income ratio are in the range of 0 to 2.
fig, ax=plt.subplots(ncols=2, figsize=(2*6,4))
bins = np.arange(-1,max(df.DebtToIncomeRatio),0.5)
ax[0].hist(df.DebtToIncomeRatio, bins=bins)
ax[0].set_xlabel('Debt To Income Ratio')
bins = np.arange(0,2.001,0.1)
ax[1].hist(df.DebtToIncomeRatio, bins=bins)
ax[1].set_xlabel('Debt To Income Ratio [0,2]')
plt.suptitle('Histograms');
Debt to income ratio distribution is bell shaped and skewed to the right.
Below plotting unadjusted histogram for stated monthly income.
plt.figure(figsize = [6,4])
bins = np.arange(-1,max(df.StatedMonthlyIncome),10000)
plt.hist(df.StatedMonthlyIncome, bins=bins)
plt.xlabel('Stated Monthly Income')
plt.suptitle('Histogram');
There appear to be some outliers (with stated income in the range from 50,000 to 1,750,000). I will check the count of such outliers.
df.StatedMonthlyIncome[df.StatedMonthlyIncome > 50000].count()
df.StatedMonthlyIncome.describe()
There are only very few of such outliers, with mean being 5,000. Reploting the histogram now withing 0 to 50,000 range.
bins = np.arange(0,50000,1000)
plt.hist(df.StatedMonthlyIncome, bins=bins)
plt.xlabel('Debt To Income Ratio [0,50k]')
plt.suptitle('Histogram');
Stated monthly income distribution is bell shaped and skewed to the right.
Below I plot side by side a histogram for full range of values on the left, and zooming into where majority of values for monthly loan payment in the range of 0 to 1000.
fig, ax=plt.subplots(ncols=2, figsize=(2*6,4))
ax[0].hist(df.MonthlyLoanPayment, bins=100)
ax[0].set_xlabel('Monthly Loan Payment')
bins = np.arange(0,1000, 50)
ax[1].hist(df.MonthlyLoanPayment, bins=bins)
ax[1].set_xticks(bins)
ax[1].set_xticklabels(bins,rotation = 90)
ax[1].set_xlabel('Monthly Loan Payment [0,1000]')
plt.suptitle('Histograms');
Monthly loan payment distribution is bell shaped and skewed to the right, also notable is relatively flat middle between USD 100 and USD 400m and spike near USD 200 monthly payments.
df.IsBorrowerHomeowner.value_counts().plot(kind='barh', figsize=[6,4])
plt.suptitle('Is borrower a home owner?');
There are more homeowners than non homeowners in the dataset.
df.Occupation.value_counts().head(50).plot(kind='bar', figsize = [12,4])
plt.xlabel('Occupation')
plt.suptitle('Occupation (count)');
There is a wide range of occupations stated especially highlighted by the fact that "other" is the largest catefory, and not one clearly defined occupation.
Plot histograms of loan properties, including Loan Status, Loan Amount, Term and Listing Category (numeric).
Plot all loan category counts on the left, and on the right plot all categories that certainly relate to non-performing loans.
fig, ax=plt.subplots(ncols=2, figsize=(2*6,4))
df['LoanStatus'].value_counts().sort_index().plot(kind='bar', ax=ax[0])
ax[0].set_xlabel('{}'.format('Loan Status'))
df['LoanStatus'].value_counts().sort_index()[3:].plot(kind='bar', ax=ax[1])
ax[1].set_xlabel('{}'.format('Loan Status (non-performing)'))
plt.suptitle('Loan Status (count)');
Most of loans are in Current and Completed category, also notable that there are over 5000 charged off loans.
Below I plot side by side a histogram for full range of values on the left, and increasing bin size (because it appears that loan amount is not continious, thus good visibility over distribution also seen with larger bins) for loan original amount on the right.
fig, ax=plt.subplots(ncols=2, figsize=(2*6,4))
ax[0].hist(df.LoanOriginalAmount, bins=100)
ax[0].set_xlabel('{}'.format('Original Loan Amount'))
bins = np.arange(0,35000,1000)
ax[1].hist(df.LoanOriginalAmount, bins=bins)
ax[1].set_xlabel('{}'.format('Original Loan Amount'))
plt.suptitle('Histograms');
The loan sizes are bell shaped, with notable spikes at around 5000, 10000, 15000, 20000 and 25000, and skewed to the right.
df.Term.value_counts().sort_index().plot(kind='bar', figsize=[6,4])
plt.xlabel('Loan term')
plt.suptitle('Loan term (count)');
There are only three types of terms, and with 36 month most popular, 60 month loans still numerous, far fewer 12 month term loans.
df['ListingCategory (numeric)'].value_counts().sort_index().plot(kind='bar', figsize = [6,4])
plt.xlabel('Loan Category - numeric (count)')
plt.suptitle('Loan Category - numeric');
Most loans are listed under 1- Debt Consolidation category, and other categories being far less frequent.
Plot distributions of loan counts accross time for date variables,including Listing Creation Date, Closed Date and Loan Origination Date.
fig, ax = plt.subplots(nrows=3, figsize=[12,12])
df.groupby([df.ListingCreationDate.dt.year,df.ListingCreationDate.dt.month])['ListingCreationDate'].count().plot(kind='bar',
ax=ax[0])
ax[0].set_xlabel('Listing creation period (year, month)')
df.groupby([df.LoanOriginationDate.dt.year,df.LoanOriginationDate.dt.month])['LoanOriginationDate'].count().plot(kind='bar',
ax=ax[1])
ax[1].set_xlabel('Loan origination period (year, month)')
df.groupby([df.ClosedDate.dt.year,df.ClosedDate.dt.month])['ClosedDate'].count().plot(kind='bar',
ax=ax[2])
ax[2].set_xlabel('Listing closed period (year, month)')
plt.subplots_adjust(hspace=0.5)
plt.suptitle('Loan distribution (2009-2014)');
Analyse the gap between loan origination date and listing creation date.
Plot histograms of the gap between the dates (which represents time during which loans would presumably have been warehoused by Prosper), on the left without ardjustment, and on the right on the log scale.
warehouse_range = (df.LoanOriginationDate - df.ListingCreationDate).dt.days
fig, ax=plt.subplots(ncols=2, figsize=(2*6,4))
ax[0].hist(warehouse_range, bins=100);
ax[0].set_xlabel('Warehouse Range (days)')
step = 0.25
bins = 10**np.arange(0, np.log10(warehouse_range.max())+step, step)
ax[1].hist(warehouse_range, bins=bins)
ax[1].set_xscale('log')
ax[1].set_xlabel('Warehouse Range (days), log scale')
plt.xticks([1,3,10,30,100,300], [1,3,10,30,100,300])
plt.suptitle('Histograms');
warehouse_range.describe()
Looks like mean loan warehousing day range is 7 days, with upper interquartile range being 12 days. For my analysis, which will focus on return/yield on loans progression over the years, this is not a significant gap, thus I will continue using Listing Creation Date as the main date variable.
Will examine the investment from friends - by checking the total number of loans that contain such investment, and, for those loans that do, the distribution of proportion invested by friends.
print(df.InvestmentFromFriendsAmount[df.InvestmentFromFriendsAmount == 0].count())
print(df.InvestmentFromFriendsAmount[df.InvestmentFromFriendsAmount > 0].count())
columns = 3
fig, ax = plt.subplots(ncols=columns, figsize = [6*columns,4])
df['FriendInvestmentRatio'] = df.InvestmentFromFriendsAmount / df.LoanOriginalAmount
variables = ['InvestmentFromFriendsAmount', 'LoanOriginalAmount', 'FriendInvestmentRatio']
for j in range(columns):
var = variables[j]
ax[j].hist(df[var][df[var]>0], bins=20)
ax[j].set_xlabel('{}'.format(var))
plt.suptitle('Histograms');
It appears that very few loans include investment by friends, and for those the proportion loan amount lent by friends is small, and highly skewed to the right.
Due to a large numberber of variables analysed, each variable's distribution is discussed separatelly near its individual chart above.
Large number of variables have a bell shaped distributions (including the dependent variables -- estimated returns, yield, loss -- and including independent variables of available bank credit (on log scale), borrower's credit score, debt to income ratio, stated monthly income and others), and majority have skewed distributions with long tails.
The data contains outliers, including:
I looked at several variables on the log scale, due to existence of the long tails in their distribution, and the following variables:
The main transformation of data was to reduce the size of the dataset, taking loans from July 2019 onwards, due to dependent variable only being available since this date, and at the same time, there being sufficient observations (84,853) remaining for the analysis.
Created a new variable CreditScoreMid, which is a mean of lower and upper range of credit score. Lower and upper range of credit score as these were highly positively correlated, thus combining them into one variable implies no loss of accurancy.
Analysed the gap between loan origination date and loan listing date, including use of computing the gap and log transforming it for plotting of its distribution, and concluded that the gap was not significant for the further investigation.
Noted loan sizes while bell shaped have notable spikes at around 5000, 10000, 15000, 20000 and 25000, and are skewed to the right, implying significantly more smaller loans issued than larger ones.
Bivariate exploration will focus on bivariate relationship between the dependent variables (Estimated Return, Yield and Loss) and each dependent variable.
Bivariate exploration will also include exploration of relationship between the three dependent variables (Estimated Return, Yield and Loss).
Furthermore, bivariate exploration will look at other relationships of interest between dependent and independent variables.
# First, will reduce the size of the dataframe analysed to a set containing variables of interest
df = df[['EstimatedReturn', 'EstimatedEffectiveYield', 'EstimatedLoss',
'LP_NetPrincipalLoss', 'LoanOriginalAmount', 'LoanLossRatio',
'InquiriesLast6Months', 'CurrentDelinquencies', 'DelinquenciesLast7Years',
'BankcardUtilization', 'AvailableBankcardCredit',
'CreditScoreMid',
'DebtToIncomeRatio', 'StatedMonthlyIncome', 'MonthlyLoanPayment', 'IsBorrowerHomeowner', 'Occupation',
'Term', 'ListingCategory (numeric)', 'LoanStatus',
'ListingCreationDate', 'LoanOriginationDate', 'ClosedDate',
'InvestmentFromFriendsAmount', 'FriendInvestmentRatio']]
df.columns
df.info()
#lets check correlations of all numeric variables in now reduced dataset
plt.figure(figsize=[20,20])
#compute correlations
corr = df.corr()
# Plot heatmap
sns.heatmap(corr, cmap = "RdBu", vmin = -1, vmax = 1, linewidths=0.5, annot = True, fmt = '.2f')
plt.title('Correlation table (numerical variables)');
From correlation matrix it's clear that that they are highly positively correlated (yield and return 0.80 corr coefficient, yield and loss 0.80 corr coefficient, return and loss, loss being smaller component than yield, 0.59 corr coefficient), and I know (from variable definitions) that estimated return should be estimated yield less estimated loss.
fig, ax = plt.subplots(ncols=2, figsize=[16,5])
sns.scatterplot(data=df, x='EstimatedReturn', y='EstimatedEffectiveYield',
hue = df.ListingCreationDate.dt.year, alpha = 0.5, ax=ax[0], legend='full')
ax[0].legend(loc='upper left')
ax[0].set_title('Estimated Yield vs Estimated Return (2009-2014)')
sns.scatterplot(data=df[(df.EstimatedReturn!=df.EstimatedEffectiveYield)&(df.EstimatedReturn>0)],
x='EstimatedReturn', y='EstimatedEffectiveYield', hue = df.ListingCreationDate.dt.year,
alpha = 0.5, ax=ax[1], legend='full')
ax[1].legend(loc='upper left')
ax[1].set_title('Estimated Yield (>0) vs Estimated Return (2010 post correction-2014)');
Note, since the left hand distribution, including all values, produced this interesting plot, I have added the listing year variable, which appears to fully explain the interesting shape of distribution on the left.
df[df.EstimatedReturn!=df.EstimatedEffectiveYield].ListingCreationDate.dt.year.value_counts()
df[df.EstimatedReturn==df.EstimatedEffectiveYield].ListingCreationDate.dt.year.value_counts()
On the left chart above, we can see, that for all 2009 loans and for large part of 2010 loans, estimated return did not include estimated loss (and thus estimated return was equal to estimated yield, producing perfect straigh line correlation).
Prosper seems to have corrected this error in computing Estimated Returns in late 2010.
Follwoing this, estimated return and estimated yield correlate positively, but not perfectly.
In addition, we can observe increasing steepness in "yearly" cohort of loans, which could be a result of increasing estimated losses (for like for like estimated yields) for each "yearly" cohort.
Below investigate relationship between estimated yield and loss.
fig, ax = plt.subplots(ncols=2, figsize=[16,5])
sns.scatterplot(data=df, x='EstimatedEffectiveYield', y='EstimatedLoss',
hue = df.ListingCreationDate.dt.year, alpha = 0.5, ax=ax[0], legend='full')
ax[0].legend(loc='upper left')
ax[0].set_title('Estimated Loss vs Estimated Return (2009-2014)')
sns.scatterplot(data=df[(df.EstimatedReturn!=df.EstimatedEffectiveYield)&(df.EstimatedReturn>0)],
x='EstimatedEffectiveYield', y='EstimatedLoss', hue = df.ListingCreationDate.dt.year,
alpha = 0.5, ax=ax[1], legend='full')
ax[1].legend(loc='upper left')
ax[1].set_title('Estimated Loss vs Estimated Return (2010 post correction-2014)');
df[df.EstimatedReturn!=df.EstimatedEffectiveYield][['EstimatedReturn','EstimatedEffectiveYield','EstimatedLoss']].corr()
Again, due to interesting pattern on the left (where a clear policy shift is clear), I added the listing year variable.
From above left chart it appears that in 2009-2010 the estimated loss whas independed of the estimated yield, and fell into several cohorts according to some other variable, following this the estimated loss became linearly related to estimated yield (achiving correlation of 0.96), and year after year the loss proportion appears to be slightly higher (as "yearly" curves shift parallely upwards).
Because:
in further investigations I will use estimated yield as proxy for investigating relationships with other variables, unless specific case required otherwise.
fig, ax = plt.subplots(ncols=2, figsize=[16,5])
sns.regplot(x = df.ListingCreationDate.dt.year, y = df.EstimatedEffectiveYield,
fit_reg = True, x_jitter= 0.3, scatter_kws= {'alpha' : 1/50}, ax = ax[0])
ax[0].set_title('Estimated Yield (2009-2014)')
sns.regplot(x = df.ListingCreationDate.dt.year, y = df.EstimatedLoss,
fit_reg = True, x_jitter= 0.3, scatter_kws= {'alpha' : 1/50}, ax = ax[1])
ax[1].set_title('Estimated Loss (2009-2014)');
fig, ax = plt.subplots(ncols=2, figsize=[16,5])
sns.pointplot(x = df.ListingCreationDate.dt.year, y = df.EstimatedEffectiveYield, ci = 'sd',ax = ax[0])
sns.pointplot(x = df.ListingCreationDate.dt.year, y = df.EstimatedLoss, ci = 'sd', ax = ax[1])
ax[0].set_title('Mean & Confidence Interval of Estimated Yield (2009-2014)')
ax[1].set_title('Mean & Confidence Interval of Estimated Loss (2009-2014)');
Estimated yield loss were lower in 2009 and 2010, then significant upward adjustment followed in 2011, following which, both have steadly declined, though the decline of estimated yield (from 2011 to 2014) was sharper than that of estimated loss in the same period.
The regplot charts above also illustrate that there were more loans listed/issued after 2011, and that they had what appears a higher density of distribution.
Further, it appears that all the negative yield loans are from 2009/2010 cohorts.
Since all negative yield loans appeared in limited range period (2009/2010), and were only 190 in number, in some of the following analysis, I exclude these observations.
df_postitive_yield=df[df.EstimatedEffectiveYield>0]
fig, ax = plt.subplots(ncols=2,figsize=[16,5])
del_bins = np.arange(0, 100 + 5, 5)
delinq7yr_bins = pd.cut(df.DelinquenciesLast7Years, del_bins, include_lowest = True)
sns.pointplot(x = delinq7yr_bins, y = df.EstimatedEffectiveYield, ci = 'sd',
linestyles='--', ax=ax[0])
ax[0].set_title('Estimated Yield vs Delinquncies(over last 7 years)')
inq_bins = np.arange(0, 20 + 2, 2)
inquiries_bins = pd.cut(df.InquiriesLast6Months, inq_bins, include_lowest = True)
sns.pointplot(x = inquiries_bins, y = df.EstimatedEffectiveYield, ci = 'sd',
linestyles='--', ax=ax[1])
ax[1].set_title('Estimated Yield vs Inquiries (over last 6 months)')
for ax in fig.axes:
plt.sca(ax)
plt.xticks(rotation=90);
More delinquncies (over the last 7 years) appear to result in only mildly higher yields, with exception of "step up" from [0-5] range to higher ranges. It's interesting to look closer at this, as I would expect a steeper positive correlation between delinquencies and estimated yield (and loss).
Greater number of inquiries appears to correlate with higher estimated yields, but the direction of this relationship starts to break down (and is flat) from 6 enquiries onwards, and is reversed after 16 enquiries, which is counterintuitive.
I will investigate both further in multivariate analysis.
plt.figure(figsize=[12,5])
sns.scatterplot(x=df_postitive_yield.AvailableBankcardCredit, y=df_postitive_yield.EstimatedEffectiveYield,
alpha = 0.03, color='steelblue')
plt.xscale('log')
plt.xticks([1,100,1000,10000,30000, 100000,1000000],[1,100,'1k','10k','30k','100k','1m'])
plt.title('Estimated Yield vs Available Bank Credit');
There appears to be negative correlation between available credit, that can be further investigated looking at available credit ranges.
sns.set(style='whitegrid')
plt.figure(figsize=[12,5])
AvailableBankCredit_bins = pd.cut(df_postitive_yield.AvailableBankcardCredit, [100,1000,10000,30000, 100000,1000000],
include_lowest = True)
sns.violinplot(x=AvailableBankCredit_bins, y=df_postitive_yield.EstimatedEffectiveYield, color='steelblue',
inner='quartile')
plt.title('Estimated Yield vs Available Bank Credit');
Above chart illustrates more clearly the negative correlation between estimated yield and available bank credit. This is intuitive, because either available credit indicates banks willingness to lend to a client or that the client hasn't used up it's credit.
The estimated yield is much lower between opposite spectrum of avaible credit ranges, where median yield for first bucket (USD 100-1000) is about 20% and median yield for last two buckets (USD 30k to 100k and USD 100k to 1m) is about 10% and 8% respectively.
plt.figure(figsize=[12,5])
sns.boxplot(x=df_postitive_yield.CreditScoreMid, y=df_postitive_yield.EstimatedEffectiveYield, color='steelblue')
plt.title('Estimated Yield vs Credit Score (mid)');
Credit score, as supplied by external credit agencies to Prosper, appears to have significant negative correlation with estimated yield. This is to say that higher credit score is related to lower loan yields.
sns.set(style='white')
fit, ax = plt.subplots(ncols=2, figsize=[16,5])
df_temp = df_postitive_yield[(df_postitive_yield.DebtToIncomeRatio>0)&(df_postitive_yield.DebtToIncomeRatio<1)]
sns.scatterplot(x=df_temp.DebtToIncomeRatio, y=df_temp.EstimatedEffectiveYield,
alpha = 0.03, color='steelblue', ax=ax[0])
DebtToIncome_bins = pd.cut(df_temp.DebtToIncomeRatio, [0,0.2,0.3,0.4,0.5,0.75,1],
include_lowest = True)
sns.violinplot(x=DebtToIncome_bins, y=df_temp.EstimatedEffectiveYield, color='steelblue',
inner='quartile')
plt.suptitle('Estimated Yield vs Debt to Income Ratio');
Debt to income ratio appears to have a positive correlation with estimated yield. This is to say that higher debt to income ratio ("higher indebtidness") leads to higher estimated yields. This relationship is muted for lower debt to income ratio buckets, but emerges more clearly as debt to income ration exceeds 0.3, and is especially pronounced when it moves above 0.75.
fit, ax = plt.subplots(ncols=2, figsize=[16,5])
df_temp = df_postitive_yield[(df_postitive_yield.StatedMonthlyIncome>0)&(df_postitive_yield.StatedMonthlyIncome<40000)]
sns.scatterplot(x=df_temp.StatedMonthlyIncome, y=df_temp.EstimatedEffectiveYield,
alpha = 0.03, color='steelblue', ax=ax[0])
StatedMonthlyIncome_bins = pd.cut(df_temp.StatedMonthlyIncome, [0,2500,5000,7500,10000,20000,40000],
include_lowest = True)
sns.violinplot(x=StatedMonthlyIncome_bins, y=df_temp.EstimatedEffectiveYield, color='steelblue',
inner='quartile')
plt.xticks(rotation=45)
plt.suptitle('Estimated Yield vs Stated Monthly Income');
Stated monthly income appears to have a negative correlation with estimated yield. This is to say that higher monthly income leads to lower estimated yields. This relationship is stronger as income rises from 0 to 7,500 a month, then relashionship weakens for monthly income exceeding 7,500.
sns.set(style='whitegrid')
plt.figure(figsize=[6,5])
sns.boxplot(x=df_postitive_yield.IsBorrowerHomeowner, y=df_postitive_yield.EstimatedEffectiveYield,
color='steelblue')
plt.title("Estimated Yield vs Borrowers' Home Ownership");
People owning homes appear to borrow at slighlty lower yiels (median and interquartile range boundries are lower), however, the overall range is about the same.
fix, ax=plt.subplots(ncols=2, figsize=[16,5])
occupation_lowyield = df_postitive_yield.groupby('Occupation')['EstimatedEffectiveYield'].mean().sort_values().head(10).index
occupation_highyield = df_postitive_yield.groupby('Occupation')['EstimatedEffectiveYield'].mean().sort_values().tail(10).index
df_temp_low = df_postitive_yield[df_postitive_yield.Occupation.isin(occupation_lowyield)]
df_temp_high = df_postitive_yield[df_postitive_yield.Occupation.isin(occupation_highyield)]
sns.boxplot(x=df_temp_low.Occupation, y=df_temp_low.EstimatedEffectiveYield,
color='steelblue', order=occupation_lowyield, ax=ax[0])
ax[0].set_xticklabels(occupation_lowyield, rotation=90)
ax[0].set_title('Estimated Yield vs Occupation (10 with lowest yields)')
sns.boxplot(x=df_temp_high.Occupation, y=df_temp_high.EstimatedEffectiveYield,
color='salmon', order=occupation_highyield, ax=ax[1])
ax[1].set_xticklabels(occupation_highyield, rotation=90)
ax[1].set_title('Estimated Yield vs Occupation (10 with highest yields)');
Looking at occupations related to lowest average yields, and occupations related to highers average yields, there does appear to be a relationship between these variables.
Typically higher earning, requiring extensive education, more "elite" professions appear to be related to lower yields (though range is high), and professions requiring lower education, likely to be more temporary basis appear to be related with higher yields.
Note, even a technical student (though as yet not earning) appears to have access to low yield loan.
Below I test how these occupations related to stated monthly income.
fix, ax=plt.subplots(ncols=2, figsize=[16,5])
mask_low = df_temp_low.StatedMonthlyIncome<30000
mask_high = df_temp_high.StatedMonthlyIncome<30000
sns.boxplot(x=df_temp_low[mask_low].Occupation, y=df_temp_low[mask_low].StatedMonthlyIncome,
color='steelblue', order=occupation_lowyield, ax=ax[0])
ax[0].set_xticklabels(occupation_lowyield, rotation=90)
ax[0].set_title('Stated Monthly Income vs Occupation (10 with lowest yields)')
sns.boxplot(x=df_temp_high[mask_high].Occupation, y=df_temp_high[mask_high].StatedMonthlyIncome,
color='salmon', order=occupation_highyield, ax=ax[1])
ax[1].set_xticklabels(occupation_highyield, rotation=90)
ax[1].set_title('Stated Monthly Income vs Occupation (10 with highest yields)');
Though occupations commanding lowest average yields do not uniformly command very high monthly incomes, the incomes of occupations having access to low yields are clearly higher (and 3/4 of instances earn above USD 5,000) than of those occupations commanding highest average yields (who without exception have 3/4 of instances earn below 5,000).
sns.set(style='white')
fit, ax = plt.subplots(ncols=2, figsize=[16,5])
sns.scatterplot(x=df_postitive_yield.LoanOriginalAmount, y=df_postitive_yield.EstimatedEffectiveYield,
alpha = 0.05, color='steelblue', ax=ax[0])
OriginalLoanAmount_bins = pd.cut(df_postitive_yield.LoanOriginalAmount, [0,5100,10100,15100,
20100,25100,30100,35100], include_lowest = True)
sns.boxplot(x=OriginalLoanAmount_bins, y=df_postitive_yield.EstimatedEffectiveYield, color='steelblue', ax=ax[1])
plt.xticks(rotation=45)
plt.suptitle('Estimated Yield vs Loan Original Amount');
As previously observed, there are "concentrations" around loand at 5k, 10k, 15k, 20k, 25k, 30k, 35k level.
Furthermore, there appear to be visibly less loans in general that are 20k and larger.
The small loans appear to have a large range in yields, and that yield range decreases as loan sizes increase.
There appear to be no large loans issued above a certain yield level. This is something to be investigated below and in multivariate analysis. Hypothesis is that large loans are also only given to high income, high credit score borrowers, hence the yields are lower.
fig, ax = plt.subplots(ncols=2, figsize=[16,5])
loan_bins = [0,5100,10100,15100,20100,25100,30100,35100]
creditscore_bins = np.arange(601,901,25)
h = ax[0].hist2d(x=df_postitive_yield.LoanOriginalAmount, y=df_postitive_yield.CreditScoreMid,
bins=[loan_bins,creditscore_bins], cmap="Blues", cmin = 100)
plt.colorbar(h[3], ax=ax[0])
OriginalLoanAmount_bins = pd.cut(df_postitive_yield.LoanOriginalAmount, bins=loan_bins, include_lowest = True)
sns.boxplot(x=OriginalLoanAmount_bins, y=df_postitive_yield.CreditScoreMid, color='steelblue', ax=ax[1])
plt.xticks(rotation=45)
plt.suptitle('Credit Score vs Loan Original Amount');
Above charts illustrate that more loans are issued of lower overall amounts, and that there is very high frequncy of smaller loans to borrowers with credit scores ranaging from 600 to 750.
Then it appears that as loan amounts increase, the borrower's credit ratings also increase, to the extent that loans above USD 25,000 are not issued to anyone with credit score lower than about 725.
sns.set(style='whitegrid')
fig, ax = plt.subplots(ncols=2, figsize=[16,5])
sns.boxplot(x=df.LoanStatus, y=df.EstimatedLoss, color='steelblue', ax=ax[0])
sns.boxplot(x=df.LoanStatus, y=df.LoanLossRatio, color='steelblue', ax=ax[1])
ax[0].set_title('Estimated Loss vs Loan Status')
ax[1].set_title('Actual Loss Ratio vs Loan Status')
for ax in fig.axes:
plt.sca(ax)
plt.xticks(rotation=90);
Overall, it appears that current laons, and loans with final payment in progress, have lower estimated losses, than those that are past due or indeed defaulted or chargedoff.
Nonetheless, actual loss ratio, which occurs only in defaulted and chargedoff loans (as indicated by chart above right), is far higher, with median reaching about 80% and 90% respectively of the original amount of the loan.
Analysis reveals that:
I observed, that lower yields (at the edge) link to occupations that are considered as more prestigious and require greater levels of education, and opposite in that higher yields link to occupations that require less education, are often part time or temporary contracts. Indeed, those same occupations respectively are associated with higher stated monthly income (commanding lower yields) and with lower stated monthly income (commanding higher yields). Interestingly, even a student(s) in technology (who have lower incomes) still are associated with lower yielding loans.
As well I obserbed that as loan amounts increase, the borrower's credit ratings also increase, to the extent that loans above USD 25,000 appear to not have been issued to anyone with credit score lower than about 725.
Another interesting observation, is that estimated losses are broadly increasing as loans are non-performing for longer (as expected), but the expected loss ratio (its interquartile range) hovers below 15%; while when loans actually default or are chargedoff, the loss ratio median is 80% (defaulted) to 90% (chargedoff).
The multivariate exploration will by and large be focused on introducing time variable into earlier analysis, to investigate trends of estimated loan yields in time, and include other relevant analysis.
sns.set(style='white')
cred_bins = np.arange(601,901,50)
creditscore_bins = pd.cut(df.CreditScoreMid, cred_bins, include_lowest = True)
creditscore_index= creditscore_bins.value_counts().sort_index().index
fig, ax = plt.subplots(ncols=2, figsize=[16,5])
palette = sns.color_palette("Blues")
sns.pointplot(x = df.ListingCreationDate.dt.year, y = df.EstimatedEffectiveYield,
hue = creditscore_bins,
ax = ax[0], palette=palette)
ax[0].set_title('Estimated Yield vs Loan Listing Date vs Credit Score')
sns.pointplot(x = df.ListingCreationDate.dt.year, y = df.EstimatedLoss,
hue = creditscore_bins,
ax = ax[1], palette=palette)
ax[1].set_title('Estimated Loss vs Loan Listing Date vs Credit Score');
Above charts illustrate that:
status_list = ['Current', 'Completed', 'Defaulted', 'Chargedoff']
df_small = df[df.LoanStatus.isin(status_list)]
fig, ax = plt.subplots(ncols=2,figsize=[18,5])
colors = ['pure blue', 'light grey', 'baby blue', 'light grey', 'light grey',
'light grey', 'light grey', 'light grey', 'light grey',
'lipstick', 'rosa']
palette = sns.xkcd_palette(colors)
del_bins = np.arange(0, 100 + 5, 5)
delinq7yr_bins = pd.cut(df_small.DelinquenciesLast7Years, del_bins, include_lowest = True)
sns.pointplot(x = delinq7yr_bins, y = df_small.EstimatedEffectiveYield, hue=df_small.LoanStatus,
ci = 'sd', linestyles='', ax=ax[0], dodge=True, palette=palette)
ax[0].legend(ncol=3,loc='lower left', fontsize=8)
ax[0].set_title('Estimated Yield vs Delinquncies vs Loan Status')
inq_bins = np.arange(0, 20 + 2, 2)
inquiries_bins = pd.cut(df_small.InquiriesLast6Months, inq_bins, include_lowest = True)
sns.pointplot(x = inquiries_bins, y = df_small.EstimatedEffectiveYield, hue=df_small.LoanStatus,
ci = 'sd', linestyles='', ax=ax[1], dodge=True, palette=palette)
ax[1].legend(ncol=3,loc='lower left', fontsize=8)
ax[1].set_title('Estimated Yield vs Inquiries vs Loan Status')
for ax in fig.axes:
plt.sca(ax)
plt.xticks(rotation=90);
Above charts illustrate that:
I conclude that it is possible that for current loans, historic delinquencies are not taken into account in pricing, while number of inquiries are more likely to be included.
I will look below also at the same two variables, and how they have correlated with estimated yields in time.
fig, ax = plt.subplots(ncols=2,figsize=[16,5])
palette = sns.color_palette("Blues")
del_bins = np.arange(0, 100 + 5, 5)
delinq7yr_bins = pd.cut(df.DelinquenciesLast7Years, del_bins, include_lowest = True)
sns.pointplot(x = delinq7yr_bins, y = df.EstimatedEffectiveYield,
hue = df.ListingCreationDate.dt.year,
ax=ax[0], palette=palette)
ax[0].set_title('Estimated Yield vs Delinquncies vs Loan Listing Date')
inq_bins = np.arange(0, 20 + 2, 2)
inquiries_bins = pd.cut(df.InquiriesLast6Months, inq_bins, include_lowest = True)
sns.pointplot(x = inquiries_bins, y = df.EstimatedEffectiveYield,
hue = df.ListingCreationDate.dt.year,
ax=ax[1], palette=palette)
ax[1].set_title('Estimated Yield vs Inquiries vs Loan Listing Date')
for ax in fig.axes:
plt.sca(ax)
plt.xticks(rotation=90);
del_bins = np.arange(0, 100 + 5, 5)
inq_bins = np.arange(0, 20 + 2, 2)
cred_bins = np.arange(601,901,50)
creditscore_bins = pd.cut(df.CreditScoreMid, cred_bins, include_lowest = True)
creditscore_index= creditscore_bins.value_counts().sort_index().index
fig, ax = plt.subplots(nrows =len(creditscore_index), ncols=2,figsize=[16,30])
palette = sns.color_palette("Blues")
row = 0
for credit_score in creditscore_index:
df_credit = df[creditscore_bins == credit_score]
delinq7yr_bins = pd.cut(df_credit.DelinquenciesLast7Years, del_bins, include_lowest = True)
sns.pointplot(x = delinq7yr_bins, y = df_credit.EstimatedEffectiveYield, hue = df_credit.ListingCreationDate.dt.year,
ax=ax[row,0], palette=palette, ci=None)
ax[row,0].title.set_text('Credit score ={}'.format(credit_score))
inquiries_bins = pd.cut(df_credit.InquiriesLast6Months, inq_bins, include_lowest = True)
sns.pointplot(x = inquiries_bins, y = df_credit.EstimatedEffectiveYield, hue = df_credit.ListingCreationDate.dt.year,
ax=ax[row,1], palette=palette, ci=None)
ax[row,1].title.set_text('Credit score ={}'.format(credit_score))
row +=1
plt.subplots_adjust(hspace=0.5)
for ax in fig.axes:
plt.sca(ax)
plt.xticks(rotation=45);
Above charts indicate that historic delinquncies have not a relationship with estimated yields, and this is the case for each year.
On the other hand, inquiries (over last 6 months) seem to have a positive correlation with estimated yields each year, except for 2009 and 2010. Thus I believe this measure could have been intruduced post 2010.
Its also interesting, how for higher credit scores (all but 600-650), there seems to have a considerable yield contraction after 2013, for the same number of inquiries.
fig, ax = plt.subplots(ncols=2,figsize=[16,5])
palette = sns.cubehelix_palette(8)
ABC_bins = [100,1000,10000,30000, 100000,1000000]
AvailableBankCredit_bins = pd.cut(df.AvailableBankcardCredit, ABC_bins,
include_lowest = True)
sns.pointplot(x = AvailableBankCredit_bins, y = df.EstimatedEffectiveYield,
hue = df.ListingCreationDate.dt.year,
ax=ax[0], palette=palette)
ax[0].set_title('Estimated Yield vs Available Bank Credit vs Loan Listing Date')
DTI_bins = [0,0.2,0.3,0.4,0.5,0.75,1]
DebtToIncome_bins = pd.cut(df.DebtToIncomeRatio, DTI_bins,
include_lowest = True)
sns.pointplot(x = DebtToIncome_bins, y = df.EstimatedEffectiveYield,
hue = df.ListingCreationDate.dt.year,
ax=ax[1], palette=palette)
ax[1].set_title('Estimated Yield vs Debt To Income Ratio vs Loan Listing Date')
for ax in fig.axes:
plt.sca(ax)
plt.xticks(rotation=45);
As with overall data, the negative relationship between available bank credit and estimated yield, and positive relationship between debt to income ratio and estimated yield are evident accross each year's loan cohort (except, in part, 2009-2010, presumably due to nagative rates).
It's also evident that with each laon cohort from 2011 to 2014, the estimated yields have decreased for the same "credit quality" indicator.
ABC_bins = [100,1000,10000,30000, 100000,1000000]
DTI_bins = [0,0.2,0.3,0.4,0.5,0.75,1]
cred_bins = np.arange(601,901,50)
creditscore_bins = pd.cut(df.CreditScoreMid, cred_bins, include_lowest = True)
creditscore_index= creditscore_bins.value_counts().sort_index().index
fig, ax = plt.subplots(nrows =len(creditscore_index), ncols=2,figsize=[16,30])
palette = sns.cubehelix_palette(8)
row = 0
for credit_score in creditscore_index:
df_credit = df[(creditscore_bins == credit_score)]
AvailableBankCredit_bins = pd.cut(df_credit.AvailableBankcardCredit, ABC_bins, include_lowest = True)
sns.pointplot(x = AvailableBankCredit_bins, y = df_credit.EstimatedEffectiveYield,
hue = df_credit.ListingCreationDate.dt.year,
ax=ax[row,0], palette=palette, ci=False)
ax[row,0].title.set_text('Credit score ={}'.format(credit_score))
DebtToIncome_bins = pd.cut(df_credit.DebtToIncomeRatio, DTI_bins, include_lowest = True)
sns.pointplot(x = DebtToIncome_bins, y = df_credit.EstimatedEffectiveYield,
hue = df_credit.ListingCreationDate.dt.year,
ax=ax[row,1], palette=palette, ci=False)
ax[row,1].title.set_text('Credit score ={}'.format(credit_score))
row +=1
plt.subplots_adjust(hspace=0.7)
for ax in fig.axes:
plt.sca(ax)
plt.xticks(rotation=45);
Above analysis further demonstrates that for low credit score, even "good credit" indicators (such as hogh available bank credit or low debt to income ratio) do not relate to low estimated yield; yet as credit score increases, those relationships become more ponounced.
I observed earlier that as loan amounts increase, the borrower's credit ratings also increase, to the extent that loans above USD 25,000 are not issued to anyone with credit score lower than about 725. It's interesting to look further into this.
years=df_postitive_yield.LoanOriginationDate.dt.year.value_counts().sort_index().index
loan_bins = [0,5100,10100,15100,20100,25100,35100]
creditscore_bins = np.arange(601,901,25)
creditscore_largebins = [600.001,701,901]
OriginalLoanAmount_bins = pd.cut(df_postitive_yield.LoanOriginalAmount, bins=loan_bins, include_lowest = True)
OriginalLoanAmount_index=OriginalLoanAmount_bins.value_counts().sort_index().index
rows = len(OriginalLoanAmount_index)
fig, ax = plt.subplots(nrows = rows, ncols=2, figsize=[16, rows*5])
row = 0
for loan_amount in OriginalLoanAmount_index:
mask = (OriginalLoanAmount_bins == loan_amount)
h = ax[row,0].hist2d(x=df_postitive_yield[mask].LoanOriginationDate.dt.year,
y=df_postitive_yield[mask].CreditScoreMid,
bins=[years,creditscore_bins], cmap="Blues", cmin = 50)
plt.colorbar(h[3], ax=ax[row,0])
ax[row,0].title.set_text('Loan amount range ={}'.format(loan_amount))
CreditScore_large_bins = pd.cut(df_postitive_yield[mask].CreditScoreMid, bins=creditscore_largebins,
include_lowest = True)
sns.boxplot(x=df_postitive_yield[mask].LoanOriginationDate.dt.year,
y=df_postitive_yield[mask].EstimatedEffectiveYield,
hue=CreditScore_large_bins,
palette='Blues', ax=ax[row,1])
ax[row,1].title.set_text('Loan amount range ={}'.format(loan_amount))
row +=1
plt.subplots_adjust(hspace=0.3);
From charts above, it appears that:
Analysis above shows that from 2011 to 2014, the estimated yields (and returns and losses) of loans have decreased each year (holding all risk factors constant), and that estimated returns would have decreased both as a result of reducing estimated yields, but also by slight increase in proportion applied to estimated loan losses (as proportion of yield).
Further, analysis highlighted that credit scores are particularly important when pricing loans, for example, it was evident that for low credit scores, loans will be associated with high yields, despite high available bank credit or low debt to income ratio. Credit score influence was throughout significant differentiator for loan pricing, including in loan sizes.
It's evident from the analysis that Prosper conducted major repricing of loans at the end of 2010, and also that following this it geared up to issuing larger loans and many more loans of all sizes.